Genomic survey maps differences in the molecular complement of vesicle formation machinery between Giardia intestinalis assemblages

Giardia intestinalis is a globally important microbial pathogen with considerable public health, agricultural, and economic burden. Genome sequencing and comparative analyses have elucidated G. intestinalis to be a taxonomically diverse species consisting of at least eight different sub-types (assemblages A-H) that can infect a great variety of animal hosts, including humans. The best studied of these are assemblages A and B which have a broad host range and have zoonotic transmissibility towards humans where clinical Giardiasis can range from asymptomatic to diarrheal disease. Epidemiological surveys as well as previous molecular investigations have pointed towards critical genomic level differences within numerous molecular pathways and families of parasite virulence factors within assemblage A and B isolates. In this study, we explored the necessary machinery for the formation of vesicles and cargo transport in 89 Canadian isolates of assemblage A and B G. intestinalis. Considerable variability within the molecular complement of the endolysosomal ESCRT protein machinery, adaptor coat protein complexes, and ARF regulatory system have previously been reported. Here, we confirm inter-assemblage, but find no intra-assemblage variation within the trafficking systems examined. This variation includes losses of subunits belonging to the ESCRTIII as well as novel lineage specific duplications in components of the COPII machinery, ARF1, and ARFGEF families (BIG and CYTH). Since differences in disease manifestation between assemblages A and B have been controversially reported, our findings may well have clinical implications and even taxonomic, as the membrane trafficking system underpin parasite survival, pathogenesis, and propagation.


Introduction
Giardia intestinalis is a protist pathogen responsible for the globally occurring water and foodborne diarrheal disease, Giardiasis.The World Health Organization estimates roughly 280 million cases, with greater incidence of disease occurring in resource poor regions with limited access to clean drinking water and sanitation infrastructure [1].Ingestion of Giardia cysts contaminating food and water leads to establishment of gut infection.Chronic and recurring infections in children have developmental consequences and can extend beyond the acute intestinal disease.
Although morphologically identical, multi-locus sequence genotyping has sub-divided G. intestinalis into eight distinct assemblages (i.e., genotypes), A through H, with broad animal host tropism, including humans [2].Of these, A and B have the greatest host range and are the predominant assemblages that infect humans to cause human giardiasis.Despite similarities in their ability to cause human disease, phylogenetic analyses place assemblage A closely related to the cattle-infecting assemblage E and cat assemblage F. Meanwhile, assemblage B is closer in relation to assemblage G, which infects murine hosts such as mice and rats [3].
Advances in genome sequencing technologies have resulted in an abundance of new genomic data from numerous human and animal-infecting Giardia isolates for comparative analyses.Of the human-infecting isolates, assemblage A, isolate WB was the first available genome and lent tremendous insights into molecular-level losses and sequence divergence in this parasite compared to other eukaryotes [4].Since then, other assemblage A isolates, such as AS175 and AS98, as well as assemblage B isolates, GS, GS_B, and BAH15c1, have also been sequenced and made available [5][6][7][8].Recent advances in long-read sequencing technologies have also allowed cost-efficient improvements of these previous assemblies [8][9][10].Other animal-infecting isolates from dog assemblages, C and D, and cattle assemblage E have also been sequenced and assembled, which have shed light on the molecular differences between the various Giardia assemblages [11,12].Investigating these additional animal isolates is necessary due to their implications on potential for anthropozoonotic transmission between humans and domesticated animals [13].
Fundamental biological and genetic differences have been enumerated between the two human-infecting strains.Although structurally otherwise identical, various cytogenetic differences are also present, such as the number of chromosomes per nuclei and their sizes [14,15].Although generally lower compared to other polyploid eukaryotes, genetic allelic heterozygosity (ASH) comparisons between assemblages A and B have shown variances in the levels between the two.Assemblage A isolate WB is markedly lower in overall ASH (<0.01%) compared to assemblage B isolate, GS, which has a much higher genomic allelic divergence (0.5%) [16].
From a clinical standpoint, differences in human Giardiasis outcomes caused due to assemblages A and B are also variable and often conflicting.These reports have been documented through several cross-sectional clinical studies conducted across numerous countries (e.g., United Kingdom, Scotland, Sweden, United States, Saudi Arabia, Egypt) to understand if specific strains are attributable to a higher probability of developing symptomatic or chronic disease outcomes [17][18][19][20][21].While the precise impact of molecular-level attributes and variances within Giardia assemblage biology on the manifestation of symptoms and disease remains uncertain, gaining insight into the underlying parasite biology and genomic content of these assemblages is undeniably valuable for a more comprehensive understanding of this discourse.
The increased availability of genome sequencing data from numerous assemblage A and B isolates previously enabled large-scale and multi-family comparative genomic investigation and has brought forward growing evidence that suggests considerable inter-assemblage variability at the genetic level.Much of this has been found within multiple Giardia-specific virulence gene families, such as the cysteine-rich variant-specific surface proteins (VSPs) expressed by the trophozoites during a gut infection for parasite antigenic variation in order to modulate and evade the host immune cells and other high-cysteine proteins (HC) [22].Recent genome-wide surveys have highlighted that isolates belonging to assemblage B encode upwards of 340 different VSPs compared to their assemblage A counterparts, which have smaller VSP repertoires (i.e., 136 in WB) [9,23].Most notably, the assemblages show high proportions of assemblage-specific VSP and HCs [9].
Differences in the other encoded protein machinery, particularly those belonging to the membrane trafficking system pathways, are also evident.Previous comparative genomics and phylogenetic investigations performed by us and others showed the complement and evolution of the ESCRT proteins, vesicle coats, and the ARF regulatory system proteins.Within each of these systems, consistent patterns of inter-assemblage variabilities were identified that would have substantial implications on the molecular intricacies underpinning the processes occurring at the parasite endomembrane organelles.Given the small number of isolates from each of the two human-infecting assemblages included in these studies, a deeper investigation at a population level is necessary to confirm the initial findings of intra-assemblage differences and to assess what variability exists, if any within assemblages, i.e. at the intra-assemblage level.
Large-scale genomic data from assemblage A and B isolates are therefore essential to evaluate these trends.Recently, the British Columbia Centre for Disease Control Public Health Laboratory (BCCDC PHL) sequenced 89 G. intestinalis isolates belonging to either assemblage A or B [24].Archived isolates were collected from fecal samples, surface waters, and beavers during periods of sporadic and regional Giardiasis outbreaks in across the province of British Columbia (Canada) between the years 1989 and 1995 as well as some from Hamilton, Ontario (Canada) [25].A combination of PCR and genome sequencing classified these outbreak-associated Giardia isolates within assemblage AI, AII, and B [24,25].In this study we leveraged the short-read sequencing data generated by the BCCDC PHL by first assembling the Giardia genomes, followed by a population-scale survey of the molecular complement with three important families of vesicle formation machinery, the ESCRT complexes, vesicle coats and adaptor proteins, and the ARF regulatory system proteins in these large collection of Giardia isolates [24].

Paired-end read information and data retrieval
De novo genome assembly analyses with the BCCDC PHL isolates of G. intestinalis assemblage A and B were performed using the previously sequenced and archived paired-end inserts that were generated using the Illumina MiSeq.Cysts belonging to each isolate had been previously obtained from surface water, beaver, and human fecal samples and archived by the BCCDC PHL [24,25].Detailed geographical sources for isolate retrieval and assemblage classification were provided through Tables 1 and S1 in the previous study by Tsui et al [24].
Raw sequencing reads are available at National Centre for Biotechnology Information (NCBI) Short Read Archive under the BioProject ID PRJNA280606.We downloaded the sequencing data from the European Nucleotide Archive (The European Molecular Biology Laboratory, Heidelberg, Germany).Metadata detailing sequencing run statistics and biosample accessions for the 89 isolates is provided through the S1 Table as well as in the previous study [24].

de novo genome assembly
Kraken2 analyses flagged three datasets to be highly contaminated with bacterial sequences and three with mixed assemblage taxonomic assignment.These six samples were not analyzed further.The remaining 83 paired-end filtered datasets were kept for de novo genome assembly using the ploidy-aware assembler MaSuRCA v.3.3.0, using 200,000,000 hashes from Jellyfish v.3.2.4 to generate contigs and scaffolds  in FASTA format.MaSuRCA configuration script has been made available as S2 File [29].

Genome assembly completeness and contiguity evaluations
The resulting nucleotide assemblies were subject to post-assembly quality assessments by computing genome contiguity and completeness statistics.Contiguity was assessed using the N50 metric with the Quality Assessment Tool for Genome Assemblies (QUAST) v. 5.0.2, which also determined other contig statistics such as total length in base pairs and the final percent GC for each assembly [30].This detailed report is made available through S3 Table.
In addition to genome contiguity, to assess gene set completeness and report on the number of evolutionarily conserved near-universal single-copy orthologs, Benchmarking Universal Single-Copy Orthologs (BUSCO) v. 4.1.1 was used as a tool for a translated Hidden Markov Model-based searching against OrthoDB v.10 [31,32].Run parameters for BUSCO consisted of the following: lineage dataset selection was eukaryote_odb10, analysis mode set to 'genome,' and TBLASTN e-value cut-off was kept as the default 1x10 -3 .All other BUSCO parameters, including those for the in-built AUGUSTUS v. 3.2.3gene prediction software, were kept to default.All BUSCO output scores were plotted and visualized using the ggplot2 library available through the R Tidyverse package [33].Data visualization specifications were provided through an R script, which is available as S3 File [34].

Reference-mapped gene predictions and functional annotation
The nucleotide contigs from all isolates were used to predict gene features and annotations to screen for protein orthologs.To do so, Liftoff v.1.5.1 was used for annotation mapping from closely related reference genomes [35].
Of the currently available assemblage A and B genomes, those belonging to G. intestinalis assemblage A, isolate WB (UU_20 assembly, GCA_011634545.1) and assemblage B, isolate GS (GL50801 assembly, GCA_011634595.1) have been richly annotated and publicly available [5,8].Therefore, UU_20 and GL50801 assemblies were used as references for gene prediction in the A and B assemblage genomes, respectively.Input and reference genomes were first indexed and aligned using Minimap2 v. 2.17 using the following parameters: -a -eqx-end-bonus 5 -N 50 -p 0.5 [36].Genes were successfully aligned if the coverage along the coding sequence (CDS) met a threshold of 50% or higher in sequence similarity.They were used to generate output GFF files detailing coordinates, accessions of the mapped ORFs, and accompanying gene features.Unmapped gene accessions were parsed out in a separate file.Program usage parameters were specified as per Liftoff instructions to generate corresponding GFF files and the per-sample Liftoff mapped and unmapped summaries have been made available through S4 Table .Genome output files were used for presence/absence analyses of the vesicle formation machinery in isolates of Giardia assemblage A and B, previously curated through comprehensive comparative genomics and phylogenetics in preceding investigations [37][38][39][40].Specifically, WB and GS accessions corresponding to subunits of the giardial ESCRT machinery, adaptins, retromer, COPII, COPI, clathrin, and the ARF regulatory system proteins, were searched into the newly generated GFF files.To consolidate these preliminary hits, assign orthology to any unmapped genes, and cross-identify and validate assemblage-specific orthologs/paralogs, TBLASTN (v.2.2.29+) searches, with an e-value threshold set to 0.01, were also performed [41].Orthologs of the identified vesicle formation machinery from G. intestinalis ADH, WB, GS, and GS_B were used as queries for homology searching into the contigs and scaffolds of all newly assembled genomes.Results of this survey are summarized as tile-plots, but specific scaffold locations from the TBLASTN hits have been made available through S5 Table .Using these identified ORFs, we extracted the nucleotide sequences and performed their translation using Bedtools getfasta (S4 File).Both nucleotide as well as translated amino acid coding regions from each trafficking protein analyzed are summarized in S6 Table.

Genome contiguity and completeness analyses yield uniform results across all isolates
Previously, Tsui and colleagues performed genome sequencing and assembly with 89 G. intestinalis assemblage A and B isolates archived at the BCCDC PHL to evaluate specific disease transmission dynamics underpinning the 1980s Giardiasis outbreak in British Columbia, with some of the isolates originating from New Zealand and Ontario to be used as reference [24].Although the sequenced short paired-end reads from this analysis are publicly available, the assembled genomes are not.Therefore, it was necessary to perform de novo genome assembly prior to comparative genomic analyses with the vesicle formation machinery, as per the summarized workflow (Fig 1).To do so, read taxonomic classification was performed first to assess contamination and remove sequences that did not correspond to Giardia (Fig 1).Of the 89 initially retrieved datasets, six were removed from final genome analyses, either due to Steps included read-quality assessment and filtering using Kraken2 and FastQC, followed by de novo genome assembly using MaSuRCA.Assembled contigs and scaffolds were used to determine contiguity and completeness using metrics such as N50, the total number of contigs, and BUSCO scores.The resulting contigs were used for reference-based genome annotation using Liftoff for orthology mapping of genes and presence/absence analyses of the vesicle formation proteins.False negatives and existing orthology assignments were cross-validated using TBLASTN.https://doi.org/10.1371/journal.pntd.0011837.g001considerable levels of bacterial contamination or if the samples were characterized as 'mixed' assemblages.In the latter case, according to Tsui et al. [24], a mixed assemblage assignment was denoted if the sequencing reads mapped equally to both Giardia assemblage A and B. This does not imply existence of a genetically hybrid strain but is instead a by-product of gDNA contamination or its isolation from mixed cultures containing trophozoites belonging to both strains.Therefore, these data were treated as metagenomic.Because this investigation aimed to elucidate nuanced molecular-level differences between the two assemblages, samples with 'mixed' assemblage classification were also removed from analyses.This resulted in a final total of 83 datasets being used for downstream analyses, 42 of which were classified as assemblage A and 41 as assemblage B (S5 and S6 Tables).
N50 scores and contig sizes for all assemblies were determined using QUAST.The results of this analysis calculated an average N50 of 70,543 bp and a median of 74,269 bp across all assemblies (S1 and S3 Tables).Genomes were also assembled in an average of 382 contigs that were � 1kb in size (S1 and S3 Tables).In addition to genome contiguity, completeness was also assessed by determining the number of evolutionarily conserved single-copy orthologs of eukaryotic proteins by generating BUSCO scores where overall the percent BUSCO of complete plus fragmented genes was approximately 25 to 30% across all assemblies (Figs 1 and S1).While these are much lower than what is typically expected of a eukaryotic genome from animal, plant, or fungal lineages (i.e., 85-95%), low BUSCO scores in Giardia are a drawback of limited inclusion of diverse protist lineages within the eukaryote_odb10 database.In general, metamonad lineages are highly divergent in their sequences and ancestrally lack many of the proteins present in model eukaryotes, and therefore all suffer from poor BUSCO scores [42][43][44].Instead of using these as absolute measures, BUSCO scores were evaluated against those published for other Giardia assemblies.The short-read reference genome of assemblage A, isolate WB (GCA_011634545.1)was previously determined to have a BUSCO score of C:23.9% (S:0%, D:23.9%),F:5.9%, M:70.2%, n = 255, while the short-read reference genome from assemblage B, isolate GS has a BUSCO completeness score of 23.1% (https://www.uniprot.org/proteomes/UP000001548; https://www.uniprot.org/proteomes/UP000002488)[4,6].Therefore, our re-assembled Canadian Giardia genomes are comparable, if not slightly more complete, in their BUSCO scores to those of the previously published short-read reference genomes belonging to assemblage A isolate WB and assemblage B isolate GS.

Overall GC content and genome sizes are comparable to other isolates of assemblages A and B
Aside from genome completeness and contiguity, the percent GC content [%GC] and genome size metrics for these new Giardia assemblies were also determined.
Of the 83 isolates assembled and kept for analyses, 42 belonged to assemblage AI or AII, and 41 to assemblage B. Previous genomic investigations with assemblage AI and AII isolates (i.e., DH, WB, and AS175) were approximated to be 48% GC rich [4,5,45].A similar trend was observed across the 42 BC G. intestinalis assemblage A isolates, where the %GC across all isolates ranged between 47.8 and 48.9 (Fig 2B and S3 Table).A few outliers had slightly higher scores (i.e., 49 to 49.96%), but overall, mean and median %GC for assemblage A were 48.5% and 48.36%, respectively (Fig 2C and S3 Table).Similar to assemblage A, previously investigated assemblage B isolates also had %GC ranging between 47 and 49 (i.e., GS, GS_B, and BAH15c1) [5][6][7].The results from this study are identical to those values, wherein the 41 BC B isolates had 47 to 49%GC, with a mean of 48.7% and a median of 48.9% (Fig 2A and 2C and S3 Table).
Genome sizes were also evaluated and compared against the previously published Giardia genomes.In the literature, inter-assemblage variances within the overall genome sizes of assemblage A and assemblage B isolates have been noted, wherein assemblage AI and AII isolates (i.e., WB, DH, AS175, AS98, ISS17, and ZX15) generally ranged between 10.2 Mbp and 11.7 Mbp [4][5][6]9,45].In contrast, assemblage B isolates (i.e., GS, GS_B, and GS/M clone H7) were comparatively larger, ranging between ca.11 Mbp to 13 Mbp [4][5][6]9,45].This trend was also consistent in this investigation.Apart from one outlier, the genome size range for all assemblage A isolates ranged between ca.10.6 Mbp and 11.9 Mbp, with an average of 11.0 Mbp and a median of 10.9 Mbp (Fig 3B and 3C).Assemblage B isolates, on the other, were larger and ranged between 10.8 Mbp and 13.7 Mbp, with an average of 12 Mbp and a median of 11.9 Mbp (Fig 3A and 3C).The reference isolate GS has a genome size of 12 Mbp, whereas GS_B is 13 Mbp in size.Therefore, assemblies from this study are similar in size compared to the previously published isolates from the same assemblage (Fig 3A and 3C).The GC content and genome size's collective examination yielded identical values as the previously published pan-global isolates.These findings additionally validate assembly correctness and corroborate the previously observed trends at a greater population level.Although the two assemblages are similar in their overall %GC content, isolates belonging to assemblage B are consistently larger than A.

Gene predictions and functional annotations suggests close similarity to reference genomes
Past analyses have identified some differences in membrane-trafficking machinery complement between assemblages A and B [39][40][41].However, these trends were based on a limited number of samples (three to five) per assemblage, and so a much larger sampling is needed to confirm the results.Moreover, the question of whether there are intra-assemblage differences within A or B vesicle coat machinery (Adaptins, COPI, and Retromer), COPII subcomplex, ESCRT protein complexes, and the ARF regulatory protein machinery has not been assessed at a population-level and as a result, survey of these proteins was performed with these 83 Canadian genomes to comprehend the trends previously observed in the repertoire of the endo-lysosomal vesicle formation machinery.
To do so, we first used Liftoff to map gene features and annotations from closely related species of organisms by providing the genomes from assemblage A, isolate WB (UU_20 assembly; GCA_011634545.1) and assemblage B, isolate GS (GL50801 assembly; GCA_011634595.1) as references for mapping.Many of the annotations have been curated based on experimental evidence (i.e., molecular protein characterizations, transcriptomics, and proteomics) and are regularly updated by the Giardia research community.To supplement and cross-validate Liftoff annotations, TBLASTN analyses were also performed for genes of interest to eliminate false-negative absence artifacts due to improper or missed gene mapping.
As per recent estimations, the newly updated genome of G. intestinalis WB (UU_20 assembly) is predicted to encode 4,963 protein-coding genes and 85 non-coding RNA (ncRNAs) genes, yielding a total of 5,048 genes [8].From this total, Liftoff mapped an average of 4,688 genes from the BC assemblage A genomes, with approximately 72 remaining unmapped (S4 Table ).Similarly, albeit slightly lower, the total number of protein-coding and ncRNAs in the assemblage B reference GS genome (GL50801 assembly) is estimated at 4,470 and 92, respectively, resulting in a total of 4,562 predicted genes.An average of 4,481 genes were mapped across the BC assemblage B genomes, with 62 remaining unmapped (S4 Table ).Unmapped genes may have resulted from sequence divergence or fragmented orthologs that did not share enough similarity with the target for a Minimap2 cut-off of 50%.The other possibility is that a subset of these unmapped genes are protein repertoires exclusive to the reference WB and GS genomes.Nonetheless, these findings suggest that for both assemblages A and B, approximately 98.2% of the genes and annotations from WB and GS genomes were successfully mapped, and therefore, encoded in the genomes of the newly assembled isolates.

ESCRT repertoire varies between assemblage A and B isolates but is consistent within assemblages
Previously, comparative genomic and phylogenetic analyses with the late-endosomal ESCRT subcomplexes were performed by us to trace their evolution across fornicates, including several human-infecting isolates of Giardia [39].Here patterns of universal streamlining within ESCRTs were observed across the entire Giardia genus, such as the complete loss of the ESCRTI subcomplex [39].However, aside from absences, Giardia-specific duplication events also occurred to yield several paralogs in the following subunits: ESCRTII-Vps36 (i.e., Vps36A, B, and C), ESCRTIII-Vps24 (i.e., Vps24A and Vps24B), ESCRTIIIA-Vps4 (i.e., Vps4A, B, and C), ESCRTIIIA-Vps46 (i.e., Vps46A and B), and ESCRTIIIA-Vps31 (i.e., Vps31A, B, and C) [39].Comparisons between various pan-global isolates revealed distinct differences within the repertoire of the individual subunits and the number of paralogs that comprise the giardial ESCRTII, ESCRTIII, and ESCRTIIIA.To better elucidate these interassemblage differences within ESCRTs and to assess if these trends exist at a population level, the repertoire of the giardial ESCRT machinery in the newly assembled BC Giardia genomes was reconstructed.This was done by surveying the Liftoff annotations combined with TBLASTN searches into the nucleotide contigs and scaffolds.Overall, our findings confirm and extend past results of ESCRT retention and loss within and between assemblages.
All 42 assemblage A isolates are conserved in their Giardia repertoire of the ESCRT complexes without any variabilities at the individual isolate level (Fig 4A).Contrary to this, previously noted streamlining exists in several ESCRT machinery components in assemblage B isolates (Fig 4B).The most prominent of these was the loss of Vps20L.The survey of 41 new assemblage B isolates confirms that this crucial ESCRTIII component is universally absent across all sampled Giardia assemblage B genomes (Fig 4B).TBLASTN searches using Vps20L orthologs identified in pan-global assemblage A isolates (i.e., DH, WB, and AS175) were used as queries to rule out false-negative artifacts.Despite this approach, no protein hits resembling Vps20L orthology were identified.Similar trends were observed within one of the three Vps4 paralogs (i.e., Vps4C), which also remained unidentified in the pan-global isolates of assemblage B. Using both Liftoff and TBLASTN, only Vps4A and Vps4B were retrieved across all 41 isolates (Fig 4B).Therefore, the absence of Vps20L and Vps4 points towards a global loss of these components in this assemblage.
Altogether, these findings validate previous observations of molecular differences within the endo-lysosomal ESCRT subunits between the two assemblages and allow for an extension of those conclusions at a greater population level.

Vesicle coat complexes follow a pattern of inter-assemblage molecular differences
As with the ESCRTs, the evolution and the molecular complement of the heterotetrameric complexes (adaptins and COPI) has been addressed [38].To consolidate those findings as well as to identify any other possible isolate-level differences, the Giardia complement of heterotetrameric adaptor complexes (HTACs) from 83 BCCDC assemblage A and B genomes were determined.
Unexpectedly, two instances of gene absences within components of the AP-1 subcomplex in genomes of assemblage B were noted.First was the lack of AP-1μ in the SRR1957167 assembly and the second absence was within the AP-1σ subunit in the VANC/96/UBC/129 assembly (Fig 6B).TBLASTN searching using AP-1μ and AP-1σ orthologs from other assemblage B genomes could not identify μ1 or σ1 subunits in these genomes other than those belonging to the AP-2 sub-complex.The absence of AP-1μ and AP-1σ, although surprising, is highly unlikely and should be ascertained as a technical assembly or sequencing-related issue.AP-1μ is absent from the SRR1957167 assembly belonging to assemblage B, which is an outlier as it has a comparatively smaller genome size (10.9Mbp) and lower N50 (44, 795 bp) with its counterparts.Both attributes are likely a result of low starting sequencing depth compared to the other isolates and hence a cause for gene absence artifacts.In the case of AP-1σ, which is missing from the VANC/96/UBC/129 assembly, may have been due to low coverage k-mers in this region and therefore discarded during the assembly process.These are more likely scenarios than instances of actual loss, as the remaining 39 isolates from assemblage B encode both adaptin subunits.
The molecular complement of COPII, retromer, and clathrin have not been examined at the same depth and as recently as the above systems and therefore were examined here.For nearly all components we found the identical repertoire across Retromer, Clathrin, and most of COPII.Unexpectedly, we find multiple paralogs of the COPII component Sec24, which we have termed Sec24A, B, and C across all 41 assemblage A genomes (Fig 5).However, COPII-Sec24C was not found in any of the assemblage B genomes, despite searching by TBLASTN analyses using assemblage A Sec24C sequences (Fig 5B ).

Paralogues of the ARF regulatory system proteins continue to differ between the two assemblages
Finally, although ancestral losses shaped the fornicate ARF regulatory system, surprisingly tight conservations within the overall complement of ARF1, ARF GAPs, and ARF GEFs exists in Giardia and its free-living relatives [40].Despite this, there remained critical differences in the encoded proteins of the two assemblages.Namely, paralogs of the fornicate-specific ARF1F proteins and Sec7 domain-containing ARF GEFs differed [40].
In this previous survey, of the three fornicate-specific ARF paralogues traced in G. intestinalis, assemblage A possessed all three ARF1 variants (i.e., ARF1FA, ARF1FB1, and ARF1B2), identified, whereas a secondary loss within one of the two ARF1FB paralogs (i.e., ARF1FB2) was identified in the 41 assemblage B examined [40].Extending this to the 83 genomes, this was confirmed, whereby all new assemblage A genomes possessed the three paralogs, while ARF1FB1 was universally absent across all 41 assemblage B genomes (Fig 7B).Unlike the previously discussed systems where absences were strictly observed in assemblage B, that was not the case with the ARF regulatory system proteins.Assemblage A has been reported to lack one of the two BIGL proteins, while assemblage B did not possess one of the two Cytohesin identified in assemblages A and E [40].Once again, the population-level survey confirms the absence of one of the two BIGL proteins from assemblage A and CYTHL from assemblage B (Fig 7), but no intra-assemblage variability.

Discussion
Past molecular evolutionary work has identified distinct differences within subunits of the trafficking complexes encoded by the two human-infecting G. intestinalis assemblages.Because the previous sampling was limited to few genomes, a population-scale investigation to better define the heterogeneity within these protein complements was necessary.The assembly of 83 full-length genomes belonging to Canadian isolates of G. intestinalis assemblage A and B permitted a large-scale comparative analysis of the vesicle formation machinery.

Genome size differences corroborates observed inter-assemblage cytogenic differences
Because it was assumed that the two human-infecting assemblages would be genetically distinct strains, an impartial examination into the differences between the trafficking complement and other genomic attributes (i.e., genome size and GC content) could have only been made appropriately through a de novo approach.Therefore, a de novo, instead of referencealigned assembly, approach was pursued for genome assembly to limit reference bias and account for possible structural and sequence variants.Using MaSuRCA, we assembled a roughly equal number of genomes belonging to both assemblage A and B isolates (i.e., 42 assemblage A and 41 assemblage B genomes) for comparative assessment of %GC and genome sizes between the isolates analyzed in this study alongside publicly available reference genomes of WB and GS.
Although we did not observe major variability within the %GC content between assemblages or individual isolates, inter-assemblage differences exist within the overall genome size.The Canadian assemblage A isolates from this investigation ranged between 10.6 to 12.3 Mbp in size similar to other public assemblage A isolates (i.e., 10.3 to 12.08 Mbp) [4,5,46].In contrast, Assemblage B genomes from our study (i.e., 11 to 13.7 Mbp) and those publicly available (i.e., 10.4 to 13.6 Mbp) were comparatively larger.This could be corroborated by pulsed-field gel electrophoresis studies which have allowed for the separation and comparative assessment of molecular weights.Analyses of the five Giardia chromosomes (Chr) belonging to isolate WB and GS revealed distinct differences where WB encodes Chrs that are 1.5, 2, 3, and 3.5 Mbp in size while GS has a slightly larger Chr1 (1.8 Mbp) due to a recombination event with Chr2 [47,48].Optical mapping between these five chromosomes has also identified considerable structural rearrangements, inversions, and translocations [5].Notably, even within the syntenic regions, only 70% sequence similarity was present between GS and WB.Comparisons between the number of protein-coding genes were also dissimilar, where ortholog overlap analyses identified 2962 heterologous open reading frames (ORFs) in assemblage B isolate GS_B.In contrast, assemblage A isolates, DH and WB, were reduced in this number by half, where only 1935 and 1067 unique ORFs were identified, respectively [5].

Potential impacts of missing vesicle formation machinery on secretory processes between the two assemblages
Previous functional work has determined the role of the vesicle formation machinery within Giardia to be critical to the parasite's ability to facilitate secretory and uptake processes in trophozoites and encysting cells [39,[49][50][51][52][53][54].Therefore, divergences in its molecular complement could implicate crucial functional variabilities within giardial trafficking processes at these organism-specific compartments.
Within ESCRTs, variations in ESCRTIII-Vps20L and ESCRTIIIA-Vps4 were identified.Cellular localization investigations by us and by previous others in the field have lent unparalleled insights into the promiscuity of ESCRTs at different giardial endomembrane compartments.Giardia ESCRTIII-Vps20L was primarily localized to the ER in conjunction with numerous ESCRT subunits [39].While the specific dynamics of inter-ESCRT association by sequential recruitment onto various organellar membranes are currently unknown in Giardia, some conserved aspects such as cargo recognition and membrane remodelling must still exist at those different compartments.Therefore, the absence of critical subunits such as Vps20L in assemblage B isolates implies possible functional compensation mechanisms to fulfill those same roles or that there is divergence in the underlying pathway wherein Vps20L is simply not necessary in this assemblage.While the former scenario may very well be the case for the missing paralogs such as Vps4C, which may be having redundant cellular roles as other Vps4 proteins at the peripheral vacuoles in assemblage A isolates, the latter scenario could signify crucial biological differences.Differences in paralog functions and molecular association of the giardial trafficking machinery, including the ESCRTs (i.e., Vps4 and Vps46), have been hinted at previously and therefore is a plausible scenario, but one that still requires robust molecular and biochemical testing [55].Nonetheless, these results posit a potentially differentiated underlying mechanism by which the existing ESCRT subunits function within the Giardia trophozoites of the two assemblages.
Like the ESCRTs, although most vesicle coats (i.e., adaptins, COPI, and retromer) are conserved in their complement between the two assemblages, differences were noted within the COPII components specifically in the number of lineage-specific paralogs of Sec24.Giardial COPII machinery is fundamental to ESV biogenesis and maturation to ensure the transport of cyst-wall material to the parasite surface [49,56].Sec24, along with Sec23, forms pre-budding complexes upon cyst-wall protein recognition for a COPII-coat assembly at the ER-exit sites [49,57].In the absence of one of the three Sec24 paralogues, it could be that the remaining two Sec24 proteins may be fulfilling this role in assemblage B isolates and that plasticity within this system exists to accommodate losses such as this one.A contrary scenario may also be possible, wherein absence in Sec24 paralogs could indicate differences in COPII-assembly dynamics.All fornicates lack Sec16, which typically stabilizes Sec23/Sec24 and Sec13/Sec31 coats during the vesicle budding process.Therefore, ancestral neofunctionalization of these lineage-specific paralogs may have occurred to compensate for missing Sec16 roles.A secondary loss of this protein in Giardia assemblage B could then have implications on the molecular dynamics of how the COPII-coat is stabilized in those parasites, which in turn could translate to altered ESV-biogenesis mechanisms or cyst-wall material trafficking.These postulations provide hypotheses for testing in encystation experiments in assemblage B models versus assemblage A. Finally, differences in the ARF regulatory system proteins, especially the ARF1F GTPases, again may indicate functional redundancy of these paralogues in assemblage B at the PECs.Differences in the GEF complement may indicate variable regulation of the giardial ARFs.While testing these scenarios through fluorescent microscopy and proteomics experiments in the trophozoite stages of assemblage B isolates would be interesting, it is challenging to generate and establish parasite cultures of transgenic variants of assemblage B isolates, as they are slow growing and not amenable to stable or episomal integration of plasmids for recombination protein expression.Once technical advancements are made in this area of Giardia biology, these experiments should be pursued as they will open avenues to elucidate novel ARF biology and GTPase regulation mechanisms in Giardia and eukaryotes in general.
During revisions of this manuscript, Klotz and colleagues released 5 assemblage B genomes [9].To determine whether the more recent and comprehensive iterations of the five publicly available assemblage B genomes (P387, P424, P458, P344, and P427) encode the absent paralogs of ESCRTIII (Vps20L, Vps4C), COPII (Sec24C), and ARF (ARF1B2, CYTHL) family proteins identified in assemblage A, we conducted both forward tBLASTn and reverse BLASTx searches on the nucleotide scaffolds from the Klotz et al study [9].Our searches failed to retrieve correct orthologs of these proteins in reverse BLASTx, consistent with the findings in this study.It is important to note that only scaffolds from this study are currently available, therefore, we refrain from making definitive conclusions of absences which can only be made once these similarity searches at the protein-level are performed if protein predictions of these genomes become available in the future.

Giardia as a species-complex
Most tantalizingly, our findings support previously reported genomic and molecular complement-level differences at the population level.The definition of what constitutes 'species' remains a hotly debated philosophical discussion within the fields of ecology, evolution, and medicine.These lines are especially blurred in microbiology.Historically, the biological species concept has defined two organisms to belong to the same species if, through sexual reproduction, they can produce viable progeny [58].There is limited evidence for sexual reproduction or inter-assemblage genetic exchange in Giardia.Therefore, their species status under the strict umbrella of the biological species concept remains contested.On the other hand, the phylogenetic species concept defines species as a group of organisms with shared and unique evolutionary histories that possess a defined set of traits [59].Phylogenetic placement using multilocus sequence genotyping places G. intestinalis assemblages A and B as paraphyletic lineages nested within other animal-infecting assemblages [3,60,61].This implies that zoonoses for human pathogenesis occurred convergently.Combined with these phylogenetic classifications, differences in genome attributes and encoded protein repertoires strengthen the notion that G. intestinalis assemblage A and B (and all other assemblages) can be defined as different species as per the phylogenetic-species concept.Additionally, previous genome comparisons estimate only 77% nucleotide-level sequence conservation and 78% amino-acid similarity in the protein-coding regions between assemblages A and B isolates [6].Genome recombination analyses performed on assemblage A, B, and E isolates further supported this notion [62].Therefore, from evolutionary, molecular, and genomic standpoints, the combined evidence further support the idea that G. intestinalis assemblages as separate species, as has been proposed by many other Giardia genomic and molecular biologists [9,[63][64][65][66].
We must acknowledge a notable limitation in our study in that our current approach, reliant on Liftoff for genome annotation using the WB and GS references, is constrained in its ability to perform de novo annotation and comprehensive pan-genome analyses, encompassing both core and accessory genes.Recent frameworks towards a pan-genome analysis have been introduced by Seabolt and colleagues which will be a valuable way forward for the Giardia genomic community in performing similar comparative analyses [65].This alongside increased availability of contiguous and more polished genomes as reference assemblies will be valuable towards understanding molecular variances between different species and substrains of Giardia with an increased level of sensitivity and specificity [9].Nonethless, our findings are that there exist limited but notable differences between the two human-infecting Giardia assemblages, but little to no variability within the Canadian isolates that we examined.These genomic differences, and unity respectively, point to potential cell biological explanations of membrane-trafficking and bolster the argument that these may in fact be cryptic species.Regardless, the 83 genome assemblies will provide a wealth of new genomic data for biologists to explore.

Fig 1 .
Fig 1. Workflow for de novo assembly and reference-based gene predictions.Steps included read-quality assessment and filtering using Kraken2 and FastQC, followed by de novo genome assembly using MaSuRCA.Assembled contigs and scaffolds were used to determine contiguity and completeness using metrics such as N50, the total number of contigs, and BUSCO scores.The resulting contigs were used for reference-based genome annotation using Liftoff for orthology mapping of genes and presence/absence analyses of the vesicle formation proteins.False negatives and existing orthology assignments were cross-validated using TBLASTN.

Fig 2 .Fig 3 .
Fig 2. Percent GC content comparisons between assemblage A and B BCCDC PHL genome assemblies.[A]depicts %GC for the 41 isolates belonging to assemblage B. These ranged between a minimum of 47.7% and a maximum of 49.3%.[B] depicts the %GC content for assemblage A isolates, which, like assemblage B, ranged between 47.8 and 49.7.[C] is a box-plot depiction of the percent ranges and calculated means for GC content for assemblage A and B isolates, where a similar overall % GC was present for both [i.e., ca.48%].https://doi.org/10.1371/journal.pntd.0011837.g002

Fig 4 .
Fig 4. Tile-plot depictions of the ESCRT repertoire in the BCCDC PHL isolates.[A] depicts the giardial ESCRT repertoire distribution in the newly assembled BCCDC PHL isolates compared to the two pan-global reference assemblage A isolates, WB [AI] and DH [AII].No absences were identified within any assemblage A isolates.[B] shows the distribution of the giardial ESCRT repertoire in the newly assembled BCCDC PHL assemblage B isolates compared to the two pan-global reference assemblage B isolates, GS and GS_B.Assemblage-specific losses were identified within the ESCRTIII-Vps20L and ESCRTIII-Vps4C, and are consistent with the findings in the pan-global isolates, indicated in grey.https://doi.org/10.1371/journal.pntd.0011837.g004

Fig 5 .
Fig 5. Tile-plot depictions of COPII and retromer repertoire in the BCCDC PHL isolates.[A] depicts the distribution of the giardial COPII and retromer components in the newly assembled genomes of the BCCDC PHL isolates, compared to the two pan-global reference assemblage A isolates, WB [AI] and DH [AII].No absences were identified within any of the assemblage A genomes. [B] depicts the distribution of the giardial COPII and retromer repertoire in the newly assembled BCCDC PHL assemblage B genomes and compared with the two pan-global reference assemblage B isolates, GS and GS_B.Assemblage Bspecific losses are evident within one of the paralogs of COPII-Sec24FII [Sec24C].https://doi.org/10.1371/journal.pntd.0011837.g005

Fig 6 .
Fig 6.Tile-plot depictions of heterotetrameric adaptor complexes and clathrin repertoire in the BCCDC PHL isolates.[A] depicts the distribution of the previously identified HTACs and clathrin components in the newly assembled BCCDC PHL isolates and compared with the two pan-global reference assemblage A isolates, WB [AI] and DH [AII].No absences were identified in any assemblage A genomes. [B] depicts the distribution of adaptin, COPI, and clathrin components in the newly assembled BCCDC PHL assemblage B genomes in comparison to the two pan-global reference assemblage B isolates, GS and GS_B.Although no large assemblage-wide losses are evident, instances of individual isolate-specific absences within AP-1 subunits [i.e., AP-1μ and AP-1σ] were present.https://doi.org/10.1371/journal.pntd.0011837.g006

Fig 7 .
Fig 7. Tile-plot depictions of the giardial ARF regulatory system protein repertoire in the BCCDC PHL isolates.[A] depicts the distribution of the previously identified giardial ARF, ARF GEF, and ARF GAP repertoire in newly assembled BCCDC PHL Giardia genomes compared to the two pan-global reference assemblage A isolates, WB [AI] and DH [AII].In the previous survey, isolate WB was shown to lack both paralogues of BIG [Pipaliya et.al., 2021].However, loss within only one of the two BIGL paralogues is apparent in the new BCCDC assemblage A isolates.[B] depicts the distribution of the giardial ARF regulatory system proteins in the newly assembled BCCDC PHL assemblage B isolates in comparison to the two pan-global reference assemblage B isolates, GS and GS_B.Assemblage-wide losses within ARF1FB2 and one of the two CYTHL paralogs are also evident and consistent with the previous results.https://doi.org/10.1371/journal.pntd.0011837.g007